(Zhu et al. 2008) (Xie 2023) (Robinson, McCarthy, and Smyth 2010) (Morgan 2022) (Durinck et al. 2009) (Gu et al. 2014) (Gu, Eils, and Schlesner 2016) (Wickham 2016) (Slowikowski 2023) (Lieberman et al. 2020) (Raudvere et al. 2019)
if (!requireNamespace("GEOmetadb", quietly = TRUE))
BiocManager::install("GEOmetadb")
if (!requireNamespace("knitr", quietly = TRUE))
install.packages("knitr")
if (!require("edgeR", quietly = TRUE))
BiocManager::install("edgeR")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("biomaRt", quietly = TRUE))
BiocManager::install("biomaRt")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
if (!requireNamespace("circlize", quietly = TRUE))
BiocManager::install("circlize")
if (!requireNamespace("ggplot2", quietly = TRUE))
install.packages("ggplot2")
if (!requireNamespace("ggrepel", quietly = TRUE))
install.packages("ggrepel")
library(BiocManager)
library(GEOmetadb)
library(knitr)
library(edgeR)
library(biomaRt)
library(ComplexHeatmap)
library(circlize)
library(ggplot2)
library(ggrepel)
On 2023/03/07 asked in lecture if I could randomly sample from data to reduce computation load. I was recommended that I should try and use the whole data. Tried to use whole data set but it was too large and would crash my R, and R studio. On 2023/03/14 posted in discussion board if I could random sample to reduce the load on the computer and reduce runtime errors. Did not get a reply by 2023/03/14 midnight so will be going with random sampling. After reducing the load did not get any crashes or errors I have gotten before with a large file.
sfiles = getGEOSuppFiles('GSE152075')
fnames = rownames(sfiles)
# there is only one supplemental file
readData = read.table(fnames[1],header=TRUE, check.names = TRUE)
Add 1 to all values of data so later on when conducting log2(cpm) we can avoid negative infinity values. (Advised by Professor Isserlin)
readData <- readData + 1
Setting first column as gene id for future format purposes
#Place rownames in first column for future format purposes
inter <- data.frame("HUGO" = rownames(readData))
geneData <- cbind(inter$HUGO, readData)
colnames(geneData)[1] <- "HUGO"
Remove any outliers that does not have at least 2 read per million in n of the samples. We set this as 2 since we add 1 to all of our dataset in the beginning of the code to have better plots. Denoting n as the smallest group of replicates which is the control group of 53. Using n = 53 conduct the removal of low counts.
#translate out counts into counts per millison using
#the edgeR package function cpm
cpms = cpm(geneData[,2:485])
rownames(cpms) <- geneData[,1]
# get rid of low counts
keep = rowSums(cpms >2) >=53
geneData_exp_filtered = geneData[keep,]
Remove version numbers if they exists on gene id(HUGO) column. This makes it easier for mapping later on.
geneData_exp_filtered[,1] <- gsub("\\.[0-9]", "", geneData_exp_filtered[,1])
#Mapping the name using biomatr
# list available gene annotation databases
bio <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")
conversion_stash <- "geneMapping.rds"
if(file.exists(conversion_stash)){
geneMapping <- readRDS(conversion_stash)
} else{
# convert column of gene IDs to Hugo symbols
geneMapping <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"),
mart = bio,
filters = "hgnc_symbol",
values = geneData_exp_filtered[,1])
saveRDS(geneMapping, conversion_stash)
}
Combine the mapped gene data to original data
#Merge the data
mergedData <- merge(geneData_exp_filtered, geneMapping, by.x = 1, by.y = 2)
#remove duplicate rows in the gene data
mergedDataNoDup <- mergedData[!duplicated(mergedData[,1:485]),]
Randomly sample data to reduce the size of sample. Original sample is too large leading to computation errors due to the limitation of author’s computer.
set.seed(12345)
randomSamplePOS <- sample(mergedDataNoDup[2:431], 25)
randomSampleNEG <- sample(mergedDataNoDup[432:485], 25)
randomSample <- cbind(randomSamplePOS,randomSampleNEG, mergedDataNoDup$ensembl_gene_id, mergedDataNoDup$HUGO)
Define groups to use in normalization
samples <- data.frame(lapply(colnames(randomSample[1:50]),
FUN=function(x){unlist(strsplit(x,
split = "_"))[c(2,1)]}))
colnames(samples) <- colnames(randomSample[1:50])
rownames(samples) <- c("patients","cell_type")
samples <- data.frame(t(samples))
Applying TMM to data
filtered_data_matrix <- as.matrix(randomSample[1:50])
rownames(filtered_data_matrix) <- randomSample$`mergedDataNoDup$ensembl_gene_id`
d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
d = calcNormFactors(d)
normalized_counts <- cpm(d)
#add columns of ensembl and hgnc id
normalized_count_data = data.frame(normalized_counts)
normalized_count_data$ensembl_gene_id <- mergedDataNoDup$ensembl_gene_id
normalized_count_data$hgnc_symbol <- mergedDataNoDup$HUGO
#This is a duplicate ensembl id that is giving errors when running code.
normalized_count_data <- normalized_count_data[-c(1902),]
plotMDS(d, labels=rownames(samples),
col = c("darkgreen","blue")[factor(samples$cell_type)])
Figure A: MDS plot from Assignment 1 # Dispersion
model_design <- model.matrix(~samples$cell_type+0)
d <- estimateDisp(d, model_design)
Graphing the BCV
plotBCV(d,col.tagwise = "black",col.common = "red",)
Figure B: BCV plot showing the biological coefficient of variation of samples
plotMeanVar(d, show.raw.vars = TRUE,
show.tagwise.vars=TRUE, NBline=TRUE,
show.ave.raw.vars = TRUE,show.binned.common.disp.vars = TRUE)
Figure C: Mean variance plot of samples # Differential Gene Expression
Answer the following questions. 1. Calculate p-values for each of the genes in your expression set. How many genes were significantly differential expressed? What thresholds did you use and why? 2. Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why? How many genes passed correction? 3. Show the amount of differential expressed genes using an MA Plot or a Volcano plot. Highlight genes of interest. 4. Visualize your top hits using a heat map. Do you conditions cluster together? Explain why or why not.
p-value calculation using LIMMA
model_design <- model.matrix(~ samples$cell_type )
Not taking into account patient variability
expressionMatrix <- as.matrix(normalized_count_data[,1:50])
rownames(expressionMatrix) <-
normalized_count_data$ensembl_gene_id
colnames(expressionMatrix) <-
colnames(normalized_count_data)[1:50]
minimalSet <- ExpressionSet(assayData=expressionMatrix)
#Fit our data to model
fit <- lmFit(minimalSet, model_design)
fit2 <- eBayes(fit,trend=TRUE)
topfit <- topTable(fit2,
coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits <- merge(normalized_count_data[,51:52],
topfit,
by.y=0,by.x=1,
all.y=TRUE)
#sort by pvalue
output_hits <- output_hits[order(output_hits$P.Value),]
#number of genes that passed the p-value threshold
#low values not in our data but that could result means the variation in biological variation in patients is different
length(which(output_hits$P.Value < 0.05))
## [1] 2892
#How many genes passed correction
length(which(output_hits$adj.P.Val < 0.05))
## [1] 473
Taking into account Patient variability
model_design_pat <- model.matrix(
~ samples$patients + samples$cell_type)
fit_pat <- lmFit(minimalSet, model_design_pat)
fit2_pat <- eBayes(fit_pat,trend=TRUE)
topfit_pat <- topTable(fit2_pat,
coef=ncol(model_design_pat),
adjust.method = "BH",
number = nrow(expressionMatrix))
#merge hgnc names to topfit table
output_hits_pat <- merge(normalized_count_data[,51:52],
topfit_pat,by.y=0,by.x=1,all.y=TRUE)
#sort by pvalue
output_hits_pat <- output_hits_pat[order(output_hits_pat$P.Value),]
length(which(output_hits_pat$P.Value < 0.05))
## [1] 1325
length(which(output_hits_pat$adj.P.Val < 0.05))
## [1] 0
d = DGEList(counts=filtered_data_matrix, group=samples$cell_type)
d <- estimateDisp(d, model_design_pat)
fit <- glmQLFit(d, model_design_pat)
qlf.pos_vs_neg <- glmQLFTest(fit, coef='samples$cell_typePOS')
kable(topTags(qlf.pos_vs_neg), type="html",row.names = FALSE)
|
|
|
|
P-values were corrected using Quasilikelihood method. Quasilikelihood is better becacuse it is tailored towards RNAseq data
qlf_output_hits <- topTags(qlf.pos_vs_neg,sort.by = "PValue",
n = nrow(normalized_count_data))
length(which(qlf_output_hits$table$PValue < 0.05))
## [1] 1569
length(which(qlf_output_hits$table$FDR < 0.05))
## [1] 276
LIMMA
1.Genes Significantly expressed(Not taking into account patient variability): 2892
Genes Significantly expressed(Taking into account patient variability): 1325
Threshold: 0.05; Used a globally accepted threshold value good for general purposed even in RNASEQ
Genes passed correction (Taking into account patient variability): 0
QLF 1.Genes Significantly expressed: 1569
Threshold: 0.05; Used a globally accepted threshold value good for general purposed even in RNASEQ
2.Genes passed correction: 276
Code copied from (Bonnin 2020)
volcanoData <- qlf_output_hits$table
volcanoData$geneName <- normalized_count_data$hgnc_symbol
volcanoData$regulated <- "Not significant"
volcanoData$regulated[volcanoData$PValue < 0.05 & volcanoData$logFC > 0] <- "Up regulated"
volcanoData$regulated[volcanoData$PValue < 0.05 & volcanoData$logFC < 0] <- "Down regulated"
volcanoData$vlabel <- NA
volcanoData$vlabel[volcanoData$regulated != "Not significant"] <- volcanoData$geneName[volcanoData$regulated != "Not significant"]
ggplot(data = volcanoData, aes(x = logFC, y = -log10(FDR), col = regulated, label = vlabel)) +
geom_point() +
theme_minimal() +
scale_color_manual(values=c("blue", "black", "red")) +
geom_vline(xintercept=c(-0.6, 0.6), col="red") +
geom_hline(yintercept=-log10(0.05), col="red") +
geom_text_repel()
Figure D: Volcano plot showing differential expressed genes highlighting genes that are statisically significant
heatmap_matrix <- normalized_count_data[,1:50]
rownames(heatmap_matrix) <- normalized_count_data$ensembl_gene_id
heatmap_matrix <- t(scale(t(heatmap_matrix)))
if(min(heatmap_matrix) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix), 0,
max(heatmap_matrix)), c("blue", "white", "red"))
}
current_heatmap <- Heatmap(as.matrix(heatmap_matrix),
show_row_dend = TRUE,show_column_dend = TRUE,
col=heatmap_col,show_column_names = TRUE,
show_row_names = FALSE,show_heatmap_legend = TRUE)
current_heatmap
Figure E: Heatmap of genes before conducting any differential analysis. On a scale from Blue to Red going negative to positive. ### LIMMA Heatmap
top_hits <- output_hits_pat$ensembl_gene_id[
output_hits_pat$P.Value<0.05]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[
which(rownames(heatmap_matrix) %in% top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0,
max(heatmap_matrix_tophits)), c("blue", "white", "red"))
}
limma_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
)
limma_heatmap
Figure F: Heatmap of genes conducting LIMMA patient model differential gene expression analysis. On a scale from Blue to Red going negative to positive.
top_hits <- rownames(qlf_output_hits$table)[
output_hits_pat$P.Value<0.05]
heatmap_matrix_tophits <- t(
scale(t(heatmap_matrix[which(rownames(heatmap_matrix)
%in% top_hits),])))
if(min(heatmap_matrix_tophits) == 0){
heatmap_col = colorRamp2(c( 0, max(heatmap_matrix_tophits)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(heatmap_matrix_tophits), 0,
max(heatmap_matrix_tophits)),
c("blue", "white", "red"))
}
qlf_heatmap <- Heatmap(as.matrix(heatmap_matrix_tophits),
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = TRUE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
)
qlf_heatmap
Figure G: Heatmap of genes conducting QLF patient model differential gene expression analysis. On a scale from Blue to Red going negative to positive.
qlf_pat_model_pvalues <- data.frame(
ensembl_id = rownames(qlf_output_hits$table),
qlf_patient_pvalue=qlf_output_hits$table$PValue)
limma_pat_model_pvalues <- data.frame(
ensembl_id = output_hits_pat$ensembl_gene_id,
limma_patient_pvalue = output_hits_pat$P.Value)
two_models_pvalues <- merge(qlf_pat_model_pvalues,
limma_pat_model_pvalues,
by.x=1,by.y=1)
two_models_pvalues$colour <- "black"
two_models_pvalues$colour[two_models_pvalues$qlf_patient_pvalue
<0.05] <- "orange"
two_models_pvalues$colour[two_models_pvalues$limma_patient_pvalue
<0.05] <- "blue"
two_models_pvalues$colour[two_models_pvalues$qlf_patient_pvalue
<0.05 &
two_models_pvalues$limma_patient_pvalue<0.05] <- "red"
plot(two_models_pvalues$qlf_patient_pvalue,
two_models_pvalues$limma_patient_pvalue,
col = two_models_pvalues$colour,
xlab = "QLF patient model p-values",
ylab ="Limma Patient model p-values",
main="QLF vs Limma")
Figure H: Visualizing p-value calculated from LIMMA and QLF. X-axis QLF patient model, Y-axis LIMMA patient model
Using QLF the NEG seem to be clustering together also.
Which ones are upregulated and downregulated
length(which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC > 0))
## [1] 1424
length(which(qlf_output_hits$table$PValue < 0.05
& qlf_output_hits$table$logFC < 0))
## [1] 145
qlf_output_hits_withgn <- merge(randomSample[,51:52],qlf_output_hits, by.x=1, by.y = 0)
#number higher the lower the pvalue, and if it is upregulated number is positive, and negative for downregulated
qlf_output_hits_withgn[,"rank"] <- -log(qlf_output_hits_withgn$PValue,base =10) * sign(qlf_output_hits_withgn$logFC)
qlf_output_hits_withgn <- qlf_output_hits_withgn[order(qlf_output_hits_withgn$rank),]
upregulated_genes <- qlf_output_hits_withgn$`mergedDataNoDup$HUGO`[
which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC > 0)]
downregulated_genes <- qlf_output_hits_withgn$`mergedDataNoDup$HUGO`[
which(qlf_output_hits_withgn$PValue < 0.05
& qlf_output_hits_withgn$logFC < 0)]
write.table(x=upregulated_genes,
file=file.path("data","upregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downregulated_genes,
file=file.path("data","downregulated_genes.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=data.frame(genename= qlf_output_hits_withgn$`mergedDataNoDup$HUGO`,F_stat= qlf_output_hits_withgn$rank),
file=file.path("data","ranked_genelist.txt"),sep = "\t",
row.names = FALSE,col.names = FALSE,quote = FALSE)
Figure 1a: Upregulated Genes GO
Figure 1b: Upregulated Genes REAC
Figure 1c: Upregulated Genes WIKI
Figure 2a: Downregulated Genes GO
Figure 2b: Downregulated Genes GO
Figure 2c: Downregulated Genes GO
Using the g:Profiler since this method is updated frequently. Threshold: 0.05. Significance threshold: Benjamini-Hochberg FDR Data SourceS: GO biological process, Reactome, Wikipathways
Using the following annotations.
Version: GRCh38.p13
GO:BP : annotations: BioMart, Release 2022-12-04
Reactome:Annotations: Biomart, 2022-12-28
Wikipathways: 2022-12-28
Threshold: 0.05
Upregulated:
GO: 379, REAC: 14, WP: 35
Downregulated:
GO: 180, REAC: 61, WP: 11
ALL: Used the list from ranked_genelist.txt. Since gene set so large need to run 2000 genes at a time, or else the browser crashes. This poses a problem where the statistical power of the gene list changes as the contents are different for each 2000 set. However the most important set is the first 2000, and the last 2000. Where the pvalue is statistically significant.
GO:444 REAC:86 WP:12
GO:102 REAC:31 WP:0
GO:82 REAC:6 WP:0
GO:0 REAC:1 WP:0
GO:1 REAC:9 WP:0
GO:60 REAC:3 WP:3
GO: 439 REAC: 12 WP: 48
Total: GO:1128 REAC:136 WP:63
The first 2000 had similar values with downregulated gene.The upregulated genes gave different results for example interferon related pathways compared to common cytoplasmic translation that came up in downregulated and the whole set. The last 2000 I assumed it would be similar to upregulated however not many pathways seemed to be in common
Do the over-representation results support conclusions or mechanism discussed in the original paper? Upregulation shows in our Gprofiler query the response to type II interferon. In our paper it mentions the “induction of interferon stimulated genes”. Also “expression of interferon-responsive genes” (Lieberman et al. 2020).
Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.
I found an article that states the SARS-CoV-2 positive thyroid tissue of patients activated Type I and Type II interferon signaling. We can see in our gprofiler query of upregulated genes that there is a significant pathway of responding to type II interferon (Poma et al. 2021).